CS 132 Data Exploration¶

Submitted by Group 17:

  • Alfonso, Francis Donald
  • Dizon, Julia Francesca
  • Paragas, Geri Angela

from CS 132 WFU

Dataset and Module Imports¶

We first import the necessary modules.

In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as po

po.init_notebook_mode()

We then import the dataset and store it in a dataframe named dataset. We clone the dataset into tweets to be able to manipulate the data non-destructively.

In [ ]:
url = "https://raw.githubusercontent.com/jolyuh/cs132-grp17-portfolio/main/assets/dataset/Combined%20Dataset%20-%20Group%2017.xlsx"

dataset = pd.read_excel(url)
tweets = dataset.copy()
tweets.shape
Out[ ]:
(150, 35)

Dataset Preprocessing¶

Since the original dataset contains columns that are not needed for the data exploration, we drop those columns (namely ID, Group, Collector, Category, Topic, Reviewer, and Review). These columns' purpose is to help the researchers distinguish the samples on a meta level and are not necessary for analysis.

In [ ]:
tweets.columns
Out[ ]:
Index(['ID', 'Timestamp', 'Tweet URL', 'Group', 'Collector', 'Category',
       'Topic', 'Keywords', 'Account handle', 'Account name', 'Account bio',
       'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet',
       'Tweet Translated', 'Tweet Type', 'Date posted', 'Screenshot',
       'Content type', 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views',
       'Rating', 'Reasoning', 'Remarks', 'Marcos supporter',
       'Duterte supporter', 'Explanation for the political stance', 'Reviewer',
       'Review'],
      dtype='object')
In [ ]:
tweets = tweets.drop(columns=['ID', 'Group', 'Collector', 'Category', 'Topic', 'Reviewer', 'Review'])
In [ ]:
tweets.shape
Out[ ]:
(150, 28)

Checking for missing data¶

Below is a summary of the dataframe information. It shows us at a glance the number of non-null entries in the dataframe. Because we have 150 samples, if the number of non-null objects is less that 150, then there are some holes in our data. For our data pre-processing, then, we investigate these holes.

In [ ]:
tweets.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 28 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   Timestamp                             150 non-null    object        
 1   Tweet URL                             150 non-null    object        
 2   Keywords                              150 non-null    object        
 3   Account handle                        150 non-null    object        
 4   Account name                          149 non-null    object        
 5   Account bio                           114 non-null    object        
 6   Account type                          150 non-null    object        
 7   Joined                                150 non-null    datetime64[ns]
 8   Following                             150 non-null    int64         
 9   Followers                             150 non-null    int64         
 10  Location                              74 non-null     object        
 11  Tweet                                 150 non-null    object        
 12  Tweet Translated                      74 non-null     object        
 13  Tweet Type                            150 non-null    object        
 14  Date posted                           150 non-null    object        
 15  Screenshot                            150 non-null    object        
 16  Content type                          149 non-null    object        
 17  Likes                                 149 non-null    float64       
 18  Replies                               149 non-null    float64       
 19  Retweets                              149 non-null    float64       
 20  Quote Tweets                          149 non-null    float64       
 21  Views                                 4 non-null      object        
 22  Rating                                5 non-null      object        
 23  Reasoning                             150 non-null    object        
 24  Remarks                               122 non-null    object        
 25  Marcos supporter                      150 non-null    bool          
 26  Duterte supporter                     150 non-null    bool          
 27  Explanation for the political stance  149 non-null    object        
dtypes: bool(2), datetime64[ns](1), float64(4), int64(2), object(19)
memory usage: 30.9+ KB
In [ ]:
# This summarizes the columns that do have null values.
for i in tweets.columns[tweets.isna().any()].tolist():
  print(i)
Account name
Account bio
Location
Tweet Translated
Content type
Likes
Replies
Retweets
Quote Tweets
Views
Rating
Remarks
Explanation for the political stance

However, while we should be filling in holes where we can, not all values are required nor available. For example, a Twitter account is not required to have a Location or an Account bio, which is why there are more null values in their columns than the others. The column Views is also largely null because the Views feature of Tweets only started rolling out in late December 2022, which is only a small period in the date range covered by our data. Thus, for our data clean-up, we only pay attention to null values that represent an actual lack in information that is needed for our research. These relevant columns are:

  • Account name
  • Content type
  • Likes
  • Replies
  • Retweets
  • Quote Tweets
  • Explanation for the political stance

To correct fields about Tweet or Twitter account info, our method of fixing it will be to replace the NaN values with what is currently. For fields that require our own assessment, we will be filling them out with our input as well.

In [ ]:
# tweets[tweets['Account name'].isna()]
# tweets.at[37, "Account name"]

tweets.at[37,'Account name']="Crux of the Matter 🕊🏃‍♀️🦅🏃‍♀️"
# tweets[tweets['Account name'].isna()]
In [ ]:
# tweets[tweets['Content type'].isna()]
# tweets.at[27, 'Content type']

tweets.at[27, 'Content type'] = "Emotional"
tweets[tweets['Content type'].isna()]
Out[ ]:
Timestamp Tweet URL Keywords Account handle Account name Account bio Account type Joined Following Followers ... Replies Retweets Quote Tweets Views Rating Reasoning Remarks Marcos supporter Duterte supporter Explanation for the political stance

0 rows × 28 columns

In [ ]:
# If you uncomment this you can see that the sample with NaN value (27)
# is the same across these four characteristics:
# tweets[tweets['Likes'].isna()]
# tweets[tweets['Replies'].isna()]
# tweets[tweets['Retweets'].isna()]
# tweets[tweets['Quote Tweets'].isna()]

# When checking out the original tweet, all four values are 0
tweets.at[27, 'Likes'] = 0
tweets.at[27, 'Replies'] = 0
tweets.at[27, 'Retweets'] = 0
tweets.at[27, 'Quote Tweets'] = 0
tweets[tweets['Quote Tweets'].isna()]
Out[ ]:
Timestamp Tweet URL Keywords Account handle Account name Account bio Account type Joined Following Followers ... Replies Retweets Quote Tweets Views Rating Reasoning Remarks Marcos supporter Duterte supporter Explanation for the political stance

0 rows × 28 columns

In [ ]:
# tweets[tweets['Marcos supporter'].isna()]
tweets.at[25, 'Marcos supporter'] = False
tweets.at[25, 'Duterte supporter'] = True
tweets.at[25, 'Explanation for the political stance'] = "Display name has fist emojis commonly associated with Duterte. Not enough data to show support for Marcos."

Ensuring consistent data formatting¶

Date posted column¶

During the process of data visualization in a subsequent section, it was found that not all of the values in the Date posted column were read by Pandas as Python datetime objects, because they were in the DD/MM/YY HH:MM format instead of the YYYY-MM-DD HH:MM:SS format. Thus, they were read as str objects instead.

In [ ]:
from datetime import datetime

string_dates = tweets[tweets['Date posted'].apply(lambda x: isinstance(x, str))]
datetime_dates = tweets[tweets['Date posted'].apply(lambda x: isinstance(x, datetime))]

print(f"Dates in str format: {string_dates.shape[0]}")
print(f"Dates in datetime format: {datetime_dates.shape[0]}")
Dates in str format: 30
Dates in datetime format: 120
In [ ]:
string_dates['Date posted'].head(5)
Out[ ]:
44    14/05/22 10:31
46    15/04/21 08:51
47    27/01/21 15:34
48    29/10/20 10:45
50    24/02/22 10:56
Name: Date posted, dtype: object
In [ ]:
datetime_dates['Date posted'].head(5)
Out[ ]:
0    2020-08-30 19:30:00
1    2020-08-23 20:12:00
2    2020-07-03 15:26:00
3    2018-03-05 13:16:40
4    2018-03-04 04:17:33
Name: Date posted, dtype: object

To fix this, we replace the original Date posted column with a modified version that creates a datetime object based on the value from the DD/MM/YY HH:MM formatted string.

In [ ]:
def get_date_slice(date):
  date_arr = date.split('/')
  return list(map(lambda x: int(x), date_arr))

def get_time_slice(time):
  time_arr = time.split(':')
  return list(map(lambda x: int(x), time_arr))

def get_datetime_from_str(date_str):
  if isinstance(date_str, datetime):
    return date_str
  
  split_date_time = date_str.split(' ')

  date = get_date_slice(split_date_time[0])
  time = get_time_slice(split_date_time[1])

  return datetime(2000 + date[2], date[1], date[0], time[0], time[1])

tweets_test = tweets['Date posted'].map(get_datetime_from_str)
tweets['Date posted'] = tweets_test
tweets['Date posted']
Out[ ]:
0     2020-08-30 19:30:00
1     2020-08-23 20:12:00
2     2020-07-03 15:26:00
3     2018-03-05 13:16:40
4     2018-03-04 04:17:33
              ...        
145   2021-02-11 16:48:00
146   2021-10-17 15:57:00
147   2021-10-17 00:43:00
148   2021-06-10 16:54:00
149   2021-05-08 08:03:00
Name: Date posted, Length: 150, dtype: datetime64[ns]

As a result of our processing, all of the values in the Date posted column are now datetime objects.

In [ ]:
tweets['Date posted'].apply(lambda x: isinstance(x, datetime)).describe()
Out[ ]:
count      150
unique       1
top       True
freq       150
Name: Date posted, dtype: object

Categorical data encoding¶

Our only data that requires encoding are the columns for Marcos supporter and Duterte supporter.

In [ ]:
tweets['Marcos supporter'] = tweets['Marcos supporter'].replace({True: 1, False: 0})
tweets['Marcos supporter']
Out[ ]:
0      0
1      1
2      0
3      1
4      0
      ..
145    0
146    1
147    1
148    1
149    1
Name: Marcos supporter, Length: 150, dtype: int64
In [ ]:
tweets['Duterte supporter'] = tweets['Duterte supporter'].replace({True: 1, False: 0})
tweets['Duterte supporter']
Out[ ]:
0      1
1      1
2      1
3      1
4      1
      ..
145    1
146    1
147    1
148    1
149    1
Name: Duterte supporter, Length: 150, dtype: int64

Exploring the numbers¶

With our dataset cleaned up, we can look at the distribution of values in the dataset.

In [ ]:
tweets.describe()
Out[ ]:
Joined Following Followers Date posted Likes Replies Retweets Quote Tweets Marcos supporter Duterte supporter
count 150 150.000000 150.000000 150 150.000000 150.000000 150.000000 150.000000 150.000000 150.00000
mean 2017-03-22 02:33:36 724.093333 1344.080000 2020-12-20 20:05:06.986666752 14.026667 1.093333 4.880000 0.906667 0.713333 0.88000
min 2006-08-01 00:00:00 0.000000 0.000000 2017-12-01 11:30:00 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000
25% 2014-01-01 00:00:00 130.000000 64.000000 2020-05-31 09:16:30 0.000000 0.000000 0.000000 0.000000 0.000000 1.00000
50% 2018-01-30 12:00:00 327.000000 273.000000 2020-11-10 00:03:30 1.000000 0.000000 0.000000 0.000000 1.000000 1.00000
75% 2020-05-01 00:00:00 778.000000 882.250000 2022-05-17 22:37:45 4.000000 0.000000 1.000000 0.000000 1.000000 1.00000
max 2022-06-01 00:00:00 9381.000000 25419.000000 2022-12-30 20:35:00 531.000000 46.000000 203.000000 41.000000 1.000000 1.00000
std NaN 1253.035516 3433.864717 NaN 59.543109 5.046557 22.912416 4.849890 0.453719 0.32605

Visualizing the data¶

We then try to visualize a general overview of the tweets in our dataset based on our classification of whether or not they are Marcos or Duterte supporters.

Distribution of political stance¶

In [ ]:
marcos = tweets.query("`Marcos supporter` == 1").shape[0]
duterte = tweets.query("`Duterte supporter` == 1").shape[0]

marcos_duterte = tweets.query("`Marcos supporter` == 1 and `Duterte supporter` == 1").shape[0]
marcos_only = tweets.query("`Marcos supporter` == 1 and `Duterte supporter` == 0").shape[0]
duterte_only = tweets.query("`Marcos supporter` == 0 and `Duterte supporter` == 1").shape[0]
neither = tweets.query("`Marcos supporter` == 0 and `Duterte supporter` == 0").shape[0]

total = tweets.shape[0]
In [ ]:
pie_data = np.array([marcos_duterte, marcos_only, duterte_only, neither])
pie_labels = [
    "Marcos-Duterte",
    "Marcos only",
    "Duterte only",
    "Neither"
]

interactive_pie = go.Pie(labels=pie_labels, values=pie_data, pull=[0, 0, 0, 0.2], title="Posters' Political Leaning")
fig = go.Figure(data=interactive_pie)
fig.show()

Distribution of account type¶

In [ ]:
rational = tweets.query("`Content type` == \"Rational\"").shape[0]
emotional = tweets.query("`Content type` == \"Emotional\"").shape[0]
transactional = tweets.query("`Content type` == \"Transactional\"").shape[0]

content_type_data = np.array([rational, emotional, transactional])
content_type_labels = [
    "Rational",
    "Emotional",
    "Transactional",
]

acct_type_counts = pd.DataFrame({
    'Content Type': content_type_labels,
    'No. of tweets': content_type_data
})

fig = px.bar(acct_type_counts, x="Content Type", y="No. of tweets", title="Content Type of collected tweets")
fig.show()

It can be seen that the majority of the tweets collected were Emotional, with a lot of them also being replies to other tweets.

In [ ]:
emotional_tweets = tweets.query("`Content type` == 'Emotional'")

emotional_tweets[['Tweet', 'Content type', 'Tweet Type']]

reply_count = emotional_tweets[emotional_tweets['Tweet Type'].str.contains('Reply')].shape[0]

print(f"Number of Emotional tweets that are also replies: {reply_count}")
emotional_tweets[['Tweet', 'Content type', 'Tweet Type']]
Number of Emotional tweets that are also replies: 88
Out[ ]:
Tweet Content type Tweet Type
0 Kayo po pumatay d ang government. Huwag nyo k... Emotional Text, Reply
1 Kawawang kabataan,sinayang ang magandang kinab... Emotional Text, Video
2 Bakit namin kailangan protektahan ang aming mg... Emotional Text, Image
3 @anakbayan_ph\n😂😂😂\nTERORISTA pa more! Emotional Text, Image
4 @anakbayan_mm@ asaan na kayabangan ninyo na ka... Emotional Text
... ... ... ...
145 NPA Mga salot sa lipunan. Tanga nlang mga nani... Emotional Text, Reply
146 Its terrorism...\nAnakbayan is a legal front o... Emotional Text, Reply
147 Kabataan Partylist? Seriously? Isa yan sa mga ... Emotional Text, Reply
148 Kawawang mga NPA at DILAWANSHIT 🤣🤣🤣 Emotional Text, Reply
149 @pnppio @TeamAFP don’t stop the attacks on com... Emotional Text, Reply

113 rows × 3 columns

Distribution of date posted¶

In [ ]:
def all_quarters():
  ret = []
  years = [str(x) for x in range(2016, 2023)]
  quarters = ["Q1", "Q2", "Q3", "Q4"]
  for year in years:
    for qtr in quarters:
      ret.append(year + qtr)
  return ret

quarter_posted = pd.PeriodIndex(tweets["Date posted"], freq='Q')
tweets['Quarter posted'] = quarter_posted

quarter_counts = list(map(lambda qtr: (tweets['Quarter posted']==qtr).sum(), all_quarters()))
In [ ]:
# Heatmap of when collected tweets were collected

data=np.array([[quarter_counts[x] for x in range(i*4,i*4 + 4)] for i in range(0, 7)])
data = data.T

fig = px.imshow(data,
                labels=dict(x="Year", y="Month", color="Tweets"),
                x=[str(x) for x in range(2016, 2023)],
                y=['Q1', 'Q2', 'Q3', 'Q4'],
                title="Distribution of 'Date posted' for tweets by quarter"
               )
fig.show()

By visualizing the distribution of post dates, we were hoping to gain additional insights on the context of the increase or decrease in the number of redtagging tweets posted during certain periods.

From the 150 tweets collected, the greatest number of redtagging tweets were found during the 2nd Quarter of 2020 and during the 2nd Quarter of 2022.

Though the scope is limited to the 150 tweets that were collected by the researchers, which could have been affected by any biases introduced by Twitter's search algorithm, the surge in numbers coincides with two major events:

  • 2nd Quarter of 2020: The first few months of the COVID-19 pandemic lockdown
  • 2nd Quarter of 2022: The 2022 Presidential Elections